Project: Explore Braun dataset¶

Upstream Steps

  • Assemble adata
  • Filtered to select cortex, cerebellum, thalamus
  • Sub-sampled to 75000 cells

This notebook

  • QC filter on cells
  • Expression filter on genes
  • Normalization and log10 transformation by Scanpy functions
  • Feature selection (HVG) by Scanpy functions
  • Dimensionality reduction
  • Batch correction by Harmony

Dataset Information¶

References¶

  • Paper Reference: Comprehensive cell atlas of the first-trimester developing human brain
  • Data and code repository: GitHub Repo

Short description: 26 brain specimens spanning 5 to 14 postconceptional weeks (pcw) were dissected into 111 distinct biological samples. Each sample was subjected to single-cell RNA sequencing, resulting in 1,665,937 high-quality single-cell

Subsample of the dataset selecting cortex, cerebellum and thalamus

Cell populations identified by the authors:
  • Erythrocyte
  • Fibroblast
  • Glioblast
  • Immune
  • Neural crest
  • Neuroblast
  • Neuron
  • Neuronal IPC
  • Oligo
  • Placodes
  • Radial glia
  • Vascular

1. Environment¶

In [1]:
import scanpy as sc, anndata as ad, numpy as np, pandas as pd
import warnings
import yaml
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.sparse as sp
import statsmodels.api as sm
import scanpy as sc
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

import matplotlib.pyplot 
import warnings
warnings.filterwarnings('ignore')

from scipy import stats
warnings.filterwarnings('ignore')
import plotly.express as px
import plotly.io as pio
import itertools
import sys
pio.renderers.default = "jupyterlab"

import random
random.seed(1)
In [2]:
homeDir = os.getenv("HOME")

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()


# import user defined functions from the utils folder
import matplotlib.pyplot as plt
sys.path.insert(1, "../2_Day2/utils")

from CleanAdata import *
from SankeyOBS import *
scanpy==1.10.1 anndata==0.10.7 umap==0.5.6 numpy==1.26.4 scipy==1.11.4 pandas==2.2.2 scikit-learn==1.4.2 statsmodels==0.14.2 igraph==0.11.5 pynndescent==0.5.12

1.3 Input file¶

You can set parameters useful for your analysis in the cells below and then recall them during the analysis

In [3]:
path = '../DataDir/InputData/'
Id = 'Project_FiltNormAdata.h5ad'
input_file = path + Id

1. Data Load¶

In [4]:
adata = sc.read_h5ad(input_file)
adata.var_names_make_unique()

2. Additional QCs¶

Here we perform few additional inspections to ensure to avoid technical issues being carried over to the next steps.

First we compute again the quality metrics:

In [5]:
# mitochondrial genes, "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))

# hemoglobin genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]")

sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
)
In [6]:
# Format better Donor and Week columns
adata.obs["Og_Donor"] = adata.obs["Og_Donor"].str.replace(":","")
adata.obs["Og_Week"] = adata.obs["Og_Week"].astype("string")
In [7]:
sc.pl.pca(adata, color=["Og_Donor", "Og_Week","Og_Chemistry","Og_Subregion"], ncols=4, wspace=.5, size = 50, vmin='p1', vmax='p99')
No description has been provided for this image

2.4 Batches inspection¶

In [8]:
adata.obs
Out[8]:
Og_Subregion Og_Tissue Og_TopLevelCluster_x Og_dissection Og_Donor Og_Age Og_Week Og_Chemistry run_id sample_id ... pct_counts_in_top_50_genes pct_counts_in_top_100_genes pct_counts_in_top_200_genes pct_counts_in_top_500_genes total_counts_mt log1p_total_counts_mt pct_counts_mt total_counts_hb log1p_total_counts_hb pct_counts_hb
CellID_Index
b'10X208_6:ACGGTCGCAGTCAGTT' Cortex Occipital_cortex 2 Occipital_cortex XDD358 11.5 11 v3 NoInfo 10X208_6 ... 4.429574 8.076678 13.885225 26.650324 22.181249 3.143344 0.541272 1.102416 0.743087 0.026901
b'10X212_6:GTGCTGGTCTTTCAGT' Cortex Lower_cortex 0 Lower_cortex XDD359 13.0 13 v3 NoInfo 10X212_6 ... 4.607994 8.334671 14.549377 28.904248 25.100336 3.261948 0.646785 1.425516 0.886044 0.036733
b'10X89_6:ACTTTCACAAACTGCT' Thalamus Thalamus 13 Thalamus BRC2006 8.0 8 v2 NoInfo 10X89_6 ... 6.697738 12.189211 21.691530 44.923577 23.801810 3.210917 0.891311 1.789264 1.025778 0.067003
b'10X199_8:CTAAGTGAGCTATCTG' Thalamus Thalamus 13 Thalamus BRC2191 6.9 6 v3 NoInfo 10X199_8 ... 4.537666 8.189201 14.390066 29.213019 29.283055 3.410588 0.768188 0.933989 0.659585 0.024502
b'10X199_5:TACCTGCGTAAGATTG' Thalamus Thalamus 34 Thalamus BRC2191 6.9 6 v3 NoInfo 10X199_5 ... 5.401125 9.620665 16.112945 30.466863 31.569998 3.483392 0.877779 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
b'10X156_1:CAAGAAAGTCTAGTGT' Cerebellum Cerebellum 9 Cerebellum XDD334 8.0 8 v2 NoInfo 10X156_1 ... 5.977457 10.746987 18.915256 39.211040 11.394504 2.517253 0.376367 0.000000 0.000000 0.000000
b'10X254_7:CACTAAGTCGTCAGAT' Thalamus Thalamus 38 Thalamus XDD385 14.0 14 v3 NoInfo 10X254_7 ... 4.178680 7.443501 13.039368 26.484720 33.267573 3.534200 0.827652 1.797263 1.028642 0.044713
b'10X102_4:GCGCAGTCATTTCACT' Cortex Cortex 28 Cortex XHU297 10.0 10 v2 NoInfo 10X102_4 ... 6.114603 10.842398 17.966865 34.507028 21.706862 3.122667 0.670763 1.065353 0.725301 0.032920
b'10X104_6:GGCAATTCAATACGCT' Thalamus Thalamus 13 Thalamus BRC2057 8.1 8 v2 NoInfo 10X104_6 ... 5.582894 9.874799 17.299339 36.175986 18.719663 2.981616 0.589286 0.000000 0.000000 0.000000
b'10X109_3:AACTTTCTCCAGAAGG' Cerebellum Cerebellum 32 Cerebellum BRC2073 6.6 6 v2 NoInfo 10X109_3 ... 6.175726 11.201448 19.565389 40.368541 15.931441 2.829172 0.535812 1.454401 0.897883 0.048915

66567 rows × 35 columns

In [9]:
plotSankey(adata, covs=["Og_Donor","Og_Subregion","Og_Chemistry"])

2.2. Multiplets removal via scDblFinder¶


In [10]:
import anndata2ri
import rpy2.rinterface_lib.callbacks
import logging
anndata2ri.activate()
%load_ext rpy2.ipython
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

import os
from rpy2 import robjects

# Set R_HOME and R_LIBS_USER
# Set R_HOME and R_LIBS_USER
# Set R_HOME and R_LIBS_USER
os.environ['R_HOME'] = '/opt/R/4.3.1/lib/R/bin//R'
os.environ['R_LIBS_USER'] = '/opt/R/4.3.1/lib/R/library'
from rpy2 import robjects
custom_lib_paths = "/opt/R/4.3.1/lib/R/library"
robjects.r(f'.libPaths(c("{custom_lib_paths}", .libPaths()))')


# Clean anndata from unnecessary fields
# Clean anndata from unnecessary fields
# Clean anndata from unnecessary fields
if os.path.isfile("./utils/DoubletCLass.tsv"):
    print("Loading precomputed dbls class")
    dblfinderClass = pd.read_csv("./utils/DoubletCLass.tsv", sep="\t", index_col=0)
    adata.obs['scDblFinder.classs'] = dblfinderClass
else:
    dblfinderClass = pd.DataFrame()
    for sample in adata.obs["sample_id"].unique():
        print(sample)
        sce = adata[adata.obs["sample_id"] == sample].copy()
        sce = sce.copy()
        sce.obs = sce.obs[["sample_id"]]
        del sce.obsp
        sce.layers["counts"] = sce.layers["counts"].astype(np.int32).todense()
        sce.X = sce.layers["counts"].copy()
        sce = sce.copy()

        del sce.obsm
        del sce.varm
        del sce.uns
        sce.var.index.name = None
        sce.var = sce.var[["EnsembleCode"]]
        # Run doublets detection
        # Run doublets detection
        # Run doublets detection

        sce = anndata2ri.py2rpy(sce)
        print(sce)
        scDblFinder = rpy2.robjects.packages.importr('scDblFinder')
        S4Vectors = rpy2.robjects.packages.importr('S4Vectors')

        as_data_frame = rpy2.robjects.r['as.data.frame']
        sce = scDblFinder.scDblFinder(sce)
        dblfinderClass = pd.concat([dblfinderClass,sce.obs[["scDblFinder.class"]]])
        #SingletsList.extend(sce.obs["scDblFinder.class"].tolist())
        del sce
    dblfinderClass["scDblFinder.class"].to_csv("./utils/DoubletCLass.tsv", sep="\t")


# Save doublets info column
#sce.obs["scDblFinder.class"].to_csv("...", sep="\t")
Loading precomputed dbls class
In [11]:
adata.obs = pd.concat([adata.obs, dblfinderClass], axis = 1).loc[adata.obs.index]
adata = adata[adata.obs["scDblFinder.class"] == "singlet"]

2.3 Additional filtering ?¶

For downstream analysis we want to be more conservative, so we inspect the relationship between the percentage of mitochondrial genes and the number of genes detected in each cell to identify outliers:

In [ ]:
 

7. Additional filtering¶

⚠️❓ Is it worthed to filter more this time?¶

⚠️❓ It could be a good idea to see how clusters and dimensionality reduction is driven by any quality metric¶

⚠️❓ Try to explore how PCs are affected by metadata (covariates)¶

In [12]:
# refine filtering

sc.pl.scatter(adata, x="pct_counts_mt",y="n_genes_by_counts", size=40, color="pct_counts_ribo")
No description has been provided for this image
In [13]:
sc.pl.pca(adata, color=["Og_Donor","Og_Subregion","Og_Chemistry","cell_class","Og_Week"])
No description has been provided for this image

We can see that cells with a low number of genes have higher percentage of mitochondrial and ribosomal genes, so we filter out those:

In [14]:
adata = adata[adata.obs["n_genes_by_counts"] >= 2000]

2.4.1 PCA regressors to check variance associated with covariates¶

A useful assessment consists in understanding how much of the variability of the PCA is explained by the covariates ("Donor_region","Auth_Batch","Auth_Assay_formatted","Auth_Age","cell_label"). We can use Ordinary Least Squares regression on principal component (PC) embeddings and plot the residuals for specific covariates. This will help us understand whether for a specific principal component and a specific covariates, we observe differences across the values of the covariate. This will highlight covariates and batches that may impact the PC and answer the question "How much technical and biological factors guide the dimensionality reduction?"

In [15]:
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols


def plotResiduals(adata, covToTest, npcs):
    PCRegrDict = {}
    sns.set_theme(style="ticks")
    varianceSummary = pd.DataFrame()
    
    for i in range(npcs):
        
        for n,c in enumerate(covToTest):
            # format the data
            Dummies = pd.get_dummies(adata.obs[c])
            PCRegr = (Dummies.T * adata.obsm["X_pca"][:,i].T).T.melt()
            PCRegr = PCRegr[PCRegr["value"] != 0]
            PCRegr["variable"] = PCRegr["variable"].astype("category")
            PCRegr["cov"] = c
            PCRegrDict["PC:{}_COV:{}".format(i+1,c)] = PCRegr.copy()
            sns.despine(offset=30)
            PCRegrModel = pd.get_dummies(PCRegrDict["PC:{}_COV:{}".format(i+1,c)], "variable").copy()
            
            # define the regression formula
            formula = "".join(["value ~ "," + ".join(["C("+c+")" for c in PCRegrModel.columns[1:-1].tolist()])])
            
            # fit the regression
            fit = ols(formula, data=PCRegrModel).fit() 
            # fit.rsquared_adj
            
            # get the residuals
            PCRegr["rsquared_adj"] = fit.rsquared_adj
            PCRegr["PC"] = i
            varianceSummary = pd.concat([varianceSummary,PCRegr ], ignore_index=True)
            
    CovOrder = {i:varianceSummary.drop_duplicates(["PC","cov"])[varianceSummary.drop_duplicates(["PC","cov"])["PC"] == i].sort_values("rsquared_adj", ascending=False)["cov"].tolist() for i in varianceSummary["PC"].unique().tolist()}
    
    for i in range(npcs):
        fig, axes = plt.subplots(1,len(covToTest), figsize=(40,4), dpi=300, sharex=False,sharey=False)
        plt.subplots_adjust(wspace=.5)
        
        for n,c in enumerate(CovOrder[i]):
            PCRegr = varianceSummary[(varianceSummary["PC"] == i) & (varianceSummary["cov"] == c)]
            sns.boxplot(data=PCRegr, x="variable", y="value", ax = axes[n],whis=[0, 100],
                        palette={i:adata.uns[c+"_colors"][adata.obs[c].cat.categories.tolist().index(i)] for i in PCRegr["variable"].unique().tolist()}
                        #order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()],
                        #hue_order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()])
                       )
            sns.stripplot(data=PCRegr, x="variable", y="value", size=4, color=".3",ax = axes[n], 
                          #order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()],
                          #hue_order=[i for i in PlotOrder[c] if i in PCRegr["variable"].unique().tolist()])
                         )
            axes[n].title.set_text('Covariate:{}'.format(c))
            axes[n].set_xlabel(c, fontsize=20)
            axes[n].set_ylabel("PC{} embeddings".format(i+1), fontsize=20)
            sns.despine(offset=30)
            fit.rsquared_adj = PCRegr["rsquared_adj"].values[0]
            axes[n].text(0.5, 0, "rsquared_adj:{}".format(round(fit.rsquared_adj,2)), horizontalalignment='center', verticalalignment='center',transform=axes[n].transAxes)
            axes[n].tick_params(axis="x", rotation=45)
            plt.xticks(rotation=45)
            fig.autofmt_xdate(rotation=45)
In [16]:
npcs = 6
covToTest = ["Og_Donor","Og_Subregion","Og_Chemistry","cell_class","Og_Week"]

plotResiduals(adata, covToTest, npcs)
<Figure size 640x480 with 0 Axes>
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

3. This dataset entails even more dataset, metacells could be again a good idea¶

⚠️❓ How many metacells(k) would you generate this time?¶

⚠️❓ Feel free to play with the dataset isolating cell types and brain regions you are most interested in, or removing the ones that could drive the PCA¶

⚠️❓ If you decide to go with metacells, which grouping_obs would you use?¶

3.1 Data preparation¶

In [17]:
BadCelltypes = ["Immune_Cortex_Pre",
"Vascular_Cortex_Pre",
"Immune_Cerebellum_Pre",
"Vascular_Cerebellum_Pre",
"Vascular_Thalamus_Pre",
"Fibroblast_Cortex_Pre",
"Fibroblast_Cerebellum_Pre",
"Immune_Thalamus_Pre",
"Erythrocyte_Cortex_Pre"]
BadCelltypes = '|'.join(BadCelltypes)


adata = adata[~adata.obs.cell_class.str.contains(BadCelltypes)]

3.2 Metacells definition and aggregation¶

In [46]:
adata.obs["Donor_region"] = adata.obs["Og_Donor"].astype("str")+"_"+adata.obs["Og_Subregion"].astype("str")
In [ ]:
NewAdataList = []

import os
os.environ["OMP_NUM_THREADS"] = "4"
n_neighbors = 30
n_pcs = 8
n_top_genes = 2000
n_clusters = 200
grouping_obs = "Donor_region"


#######################################

if not os.path.exists("./utils/kmeansAggregation/"):
    os.makedirs("./utils/kmeansAggregation/")

for sample in adata.obs[grouping_obs].unique().tolist():
    print(sample)
    adataLocal = adata[adata.obs[grouping_obs] == sample].copy()
    if adataLocal.shape[0] <= 400:
        continue
    ncells = adataLocal.shape[0]
    filenameBackup = "./utils/kmeansAggregation/Kmeans{}.{}.{}topgenes.{}neighbs.{}pcs.{}Ks.csv".format(ncells,sample,n_top_genes,n_neighbors,n_pcs,n_clusters )
    
    # Check if the kmeans was precomputed
    if os.path.isfile(filenameBackup):
        print("Retrieving precomputed kmeans classes \n")
        KmeansOBS = pd.read_csv(filenameBackup, sep="\t", index_col=0)
        adataLocal.obs['kmeans_clusters'] = KmeansOBS
    else:
        #If not run k-means metacells aggregation
        sc.pp.highly_variable_genes(adataLocal, n_top_genes=n_top_genes, flavor="seurat")
        sc.tl.pca(adataLocal)
        # Step 1: Compute the KNN graph
        sc.pp.neighbors(adataLocal, n_neighbors = n_neighbors , transformer='pynndescent', n_pcs=n_pcs, metric="euclidean")  # Adjust n_neighbors as needed
        # Step 2: Extract the connectivity matrix from AnnData
        connectivities = adataLocal.obsp['distances'].toarray()
        
        # Step 3: Apply K-Means clustering with a fixed number of clusters
        print("Computing kmeans")
        adataLocal.obs['kmeans_clusters'] = KMeans(n_clusters=n_clusters, random_state=0).fit_predict(connectivities)
        print("Storing kmeans classes \n")
        adataLocal.obs["kmeans_clusters"].to_csv(filenameBackup, sep="\t")

    
    adataLocal.X = adataLocal.layers["counts"].copy()

    # Normalize counts
    sc.pp.normalize_total(adataLocal, target_sum=2e4)

    # Create combined AnnData objects efficiently and add to NewAdataList
    # Step 2: Create dummy variables for clusters
    dummy_clusters = pd.get_dummies(adataLocal.obs['kmeans_clusters'])

    # Step 3: Dot product for mean aggregation of expression values
    # Each column in `cluster_aggregated_X` represents mean expression for a cluster
    X_dense = adataLocal.X.A if hasattr(adataLocal.X, "A") else adataLocal.X
    cluster_aggregated_X = dummy_clusters.T.dot(X_dense)

    cluster_aggregated_median_X = np.zeros((dummy_clusters.shape[1], X_dense.shape[1]))
    for cluster_idx in range(dummy_clusters.shape[1]):
        # Select cells belonging to the current cluster
        cluster_cells = X_dense[dummy_clusters.iloc[:, cluster_idx].values == 1]
        # Compute the median across cells within the cluster
        cluster_aggregated_median_X[cluster_idx, :] = np.median(cluster_cells, axis=0)

    # Convert to AnnData structure
    adataAggregated = ad.AnnData(X=cluster_aggregated_X)
    adataAggregated.layers["median"] = cluster_aggregated_median_X # we save an additional layer having as aggregated value the median of the gene expression
    adataAggregated.var_names = adataLocal.var_names
    adataAggregated.obs['kmeans_clusters'] = dummy_clusters.columns

    # Step 4: Aggregating labels and metadata
    # Get the most common cell label for each cluster
    adataAggregated.obs['AggregatedClass'] = (
        adataLocal.obs.groupby('kmeans_clusters')['cell_class']
        .agg(lambda x: x.value_counts().idxmax())
        .reindex(dummy_clusters.columns)
        .values
    )
    adataAggregated.obs['AggregatedLabel'] = (
        adataLocal.obs.groupby('kmeans_clusters')['cell_label']
        .agg(lambda x: x.value_counts().idxmax())
        .reindex(dummy_clusters.columns)
        .values
    )
    adataAggregated.obs_names = list(sample+"_"+adataAggregated.obs["kmeans_clusters"].astype(str))
    # Assign metadata fields with identical values across each cluster
    for obsMD in [col for col in adataLocal.obs.columns if len(adataLocal.obs[col].unique()) == 1]:
        adataAggregated.obs[obsMD] = adataLocal.obs[obsMD].iloc[0]

    # Add the aggregated AnnData to NewAdataList
    NewAdataList.append(adataAggregated)
XDD358_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD359_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2191_Thalamus
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XHU297_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD358_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2061_Thalamus
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2073_Cerebellum
XDD385_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2114_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD342_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD385_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2061_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XHU292_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD351_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2073_Thalamus
BRC2006_Thalamus
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2061_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XHU307_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2006_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2191_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2191_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD385_Thalamus
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2147_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
XDD334_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2057_Thalamus
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2110_Cerebellum
XDD351_Cortex
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2021_Thalamus
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2114_Cerebellum
Retrieving precomputed kmeans classes 

normalizing counts per cell
    finished (0:00:00)
BRC2057_Cerebellum

Save¶

In [21]:
# Save the new aggregated anndata
from scipy import sparse
CombinedAdata = ad.concat(NewAdataList)
# Conevert to sparse matrix
CombinedAdata.X = sparse.csr_matrix(CombinedAdata.X)
CombinedAdata.layers["median"] = sparse.csr_matrix(CombinedAdata.layers["median"])
In [22]:
print(adata.shape)
(45852, 15077)
In [23]:
print(CombinedAdata.shape)
(5200, 15077)
In [24]:
CombinedAdata.write_h5ad("./1_CombinedMetaCells.h5ad")

Re process¶

In [25]:
n_neighb = 40
n_pcs = 8

CombinedAdata = ad.concat(NewAdataList)
# Keep only clusters with at least 10 cells
AggregatedOBS = CombinedAdata.obs["AggregatedLabel"].value_counts().loc[CombinedAdata.obs["AggregatedLabel"].value_counts() > 100].index.tolist()
CombinedAdata = CombinedAdata[CombinedAdata.obs["AggregatedLabel"].isin(AggregatedOBS)]
In [26]:
import scvelo as scv
scv.tl.score_genes_cell_cycle(CombinedAdata)
calculating cell cycle phase
computing score 'S_score'
    finished: added
    'S_score', score of gene set (adata.obs).
    637 total control genes are used. (0:00:01)
computing score 'G2M_score'
    finished: added
    'G2M_score', score of gene set (adata.obs).
    475 total control genes are used. (0:00:00)
-->     'S_score' and 'G2M_score', scores of cell cycle phases (adata.obs)
In [27]:
sc.pp.normalize_total(CombinedAdata, target_sum=2e4)
sc.pp.log1p(CombinedAdata)
sc.pp.highly_variable_genes(CombinedAdata, n_top_genes=2000, flavor="seurat")
sc.pp.pca(CombinedAdata)
sc.pl.pca(CombinedAdata,color=["Donor_region","brain_region","Og_Age","Og_Chemistry"])
normalizing counts per cell
    finished (0:00:00)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
computing PCA
    with n_comps=50
    finished (0:00:00)
No description has been provided for this image
In [28]:
plotSankey(CombinedAdata, covs=["Donor_region","Og_Age","Og_Chemistry","Og_Subregion"])
In [29]:
HVGs = []
for i in CombinedAdata.obs["Og_Chemistry"].unique():
    CombinedAdataKIT = CombinedAdata[CombinedAdata.obs["Og_Chemistry"] == i]
    MinBatches = 2
    sc.pp.highly_variable_genes(CombinedAdataKIT, n_top_genes=2000, flavor="seurat", batch_key="Og_Donor")
    kitHVGs = CombinedAdataKIT.var_names[CombinedAdataKIT.var["highly_variable_nbatches"] >= 2 ].tolist()
    print(len(kitHVGs))
    HVGs.extend(kitHVGs)
    #sc.pp.hig (CombinedAdataKIT, batch_key="MinBatches", flavour="seurat")
print(len(set(HVGs)))
CombinedAdata.var["highly_variable"] = CombinedAdata.var_names.isin(HVGs)
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
2929
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
4067
4824
In [30]:
sc.tl.pca(CombinedAdata)
sc.pl.pca_variance_ratio(CombinedAdata)
sc.pp.neighbors(CombinedAdata, n_neighbors=n_neighb, n_pcs=n_pcs, metric="euclidean")
sc.tl.umap(CombinedAdata)
computing PCA
    with n_comps=50
    finished (0:00:01)
No description has been provided for this image
computing neighbors
    using 'X_pca' with n_pcs = 8
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:08)

Integration¶

In [31]:
import scanpy.external as sce

CombinedAdata = CleanAdata(CombinedAdata, obstokeep=CombinedAdata.obs.columns.tolist(), obsmtokeep="X_pca")
sce.pp.harmony_integrate(CombinedAdata, key="Og_Chemistry", max_iter_harmony=20)
2024-11-15 21:05:16,382 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
2024-11-15 21:05:18,380 - harmonypy - INFO - sklearn.KMeans initialization complete.
2024-11-15 21:05:18,404 - harmonypy - INFO - Iteration 1 of 20
2024-11-15 21:05:19,173 - harmonypy - INFO - Iteration 2 of 20
2024-11-15 21:05:19,940 - harmonypy - INFO - Iteration 3 of 20
2024-11-15 21:05:20,704 - harmonypy - INFO - Iteration 4 of 20
2024-11-15 21:05:21,472 - harmonypy - INFO - Iteration 5 of 20
2024-11-15 21:05:22,236 - harmonypy - INFO - Iteration 6 of 20
2024-11-15 21:05:23,002 - harmonypy - INFO - Iteration 7 of 20
2024-11-15 21:05:23,766 - harmonypy - INFO - Iteration 8 of 20
2024-11-15 21:05:24,540 - harmonypy - INFO - Iteration 9 of 20
2024-11-15 21:05:25,067 - harmonypy - INFO - Iteration 10 of 20
2024-11-15 21:05:25,707 - harmonypy - INFO - Converged after 10 iterations
In [32]:
sc.pp.neighbors(CombinedAdata, n_neighbors=n_neighb, n_pcs=n_pcs, use_rep="X_pca_harmony")
sc.tl.umap(CombinedAdata)
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:08)
In [33]:
sc.tl.leiden(CombinedAdata, resolution=.8)
sc.tl.rank_genes_groups(CombinedAdata,groupby="leiden", method="wilcoxon")
running Leiden clustering
    finished: found 16 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:02)
ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:14)
In [34]:
sc.pl.umap(CombinedAdata, color=["leiden","Og_Subregion","AggregatedClass"], 
              size=10, wspace=.4, ncols=2, vmin='p1', vmax='p99', 
              add_outline=True, edges=False, edges_width=.01)
No description has been provided for this image
In [35]:
plotSankey(CombinedAdata, covs=["leiden","AggregatedClass"])

MAGIC¶

In [36]:
import palantir
dm_res = palantir.utils.run_diffusion_maps(CombinedAdata, knn=n_neighb, pca_key="X_pca_harmony", n_components=n_pcs)
ms_data = palantir.utils.determine_multiscale_space(CombinedAdata)
palantir.utils.run_magic_imputation(CombinedAdata)
computing neighbors
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:00)
Out[36]:
array([[0.02159585, 0.54069483, 0.02333469, ..., 0.06535586, 0.53239476,
        0.57256068],
       [0.02731917, 0.50445414, 0.        , ..., 0.01208675, 0.1718003 ,
        0.02736146],
       [0.03031441, 0.69590421, 0.01278159, ..., 0.04684644, 0.47414945,
        0.170866  ],
       ...,
       [0.02344872, 0.65439312, 0.02709426, ..., 0.14571777, 0.43550936,
        0.18831748],
       [0.02160011, 0.47292763, 0.07538638, ..., 0.1829114 , 0.43291091,
        0.68595744],
       [0.01736921, 0.48722275, 0.06338858, ..., 0.18861574, 0.42059673,
        0.68379031]])
In [37]:
import decoupler as dc
CombinedAdata.X = CombinedAdata.layers["MAGIC_imputed_data"].copy()

# Download database of regulons
net = dc.get_collectri(organism='human', split_complexes=False)
net
net
Out[37]:
source target weight PMID
0 MYC TERT 1 10022128;10491298;10606235;10637317;10723141;1...
1 SPI1 BGLAP 1 10022617
2 SMAD3 JUN 1 10022869;12374795
3 SMAD4 JUN 1 10022869;12374795
4 STAT5A IL2 1 10022878;11435608;17182565;17911616;22854263;2...
... ... ... ... ...
43173 NFKB hsa-miR-143-3p 1 19472311
43174 AP1 hsa-miR-206 1 19721712
43175 NFKB hsa-miR-21-5p 1 20813833;22387281
43176 NFKB hsa-miR-224-5p 1 23474441;23988648
43177 AP1 hsa-miR-144 1 23546882

43178 rows × 4 columns

In [38]:
dc.run_ulm(
    mat=CombinedAdata,
    net=net,use_raw=False, 
    source='source',
    target='target',
    weight='weight',
    verbose=True
)
Running ulm on mat with 5149 samples and 15077 targets for 636 sources.
In [39]:
acts = dc.get_acts(CombinedAdata, obsm_key='ulm_estimate')
acts
Out[39]:
AnnData object with n_obs × n_vars = 5149 × 636
    obs: 'kmeans_clusters', 'AggregatedClass', 'AggregatedLabel', 'Og_Subregion', 'Og_Donor', 'Og_Age', 'Og_Week', 'Og_Chemistry', 'run_id', 'brain_region', 'scDblFinder.classs', 'scDblFinder.class', 'Donor_region', 'S_score', 'G2M_score', 'phase', 'leiden'
    uns: 'Donor_region_colors', 'brain_region_colors', 'Og_Chemistry_colors', 'neighbors', 'umap', 'leiden', 'rank_genes_groups', 'leiden_colors', 'Og_Subregion_colors', 'AggregatedClass_colors', 'DM_EigenValues'
    obsm: 'X_pca', 'X_pca_harmony', 'X_umap', 'DM_EigenVectors', 'DM_EigenVectors_multiscaled', 'ulm_estimate', 'ulm_pvals'
In [40]:
acts.obs.AggregatedClass.unique().tolist()
Out[40]:
['Radial_glia_Cortex_Pre',
 'Neuron_Cortex_Pre',
 'Neuroblast_Cortex_Pre',
 'Glioblast_Cortex_Pre',
 'Neuronal_IPC_Cortex_Pre',
 'Neuronal_IPC_Thalamus_Pre',
 'Neuron_Thalamus_Pre',
 'Radial_glia_Thalamus_Pre',
 'Neuroblast_Thalamus_Pre',
 'Neuron_Cerebellum_Pre',
 'Neuroblast_Cerebellum_Pre',
 'Glioblast_Cerebellum_Pre',
 'Neuronal_IPC_Cerebellum_Pre',
 'Radial_glia_Cerebellum_Pre',
 'Glioblast_Thalamus_Pre']
In [41]:
acts_RG = acts[acts.obs["AggregatedClass"].isin(["Radial_glia_Cortex_Pre","Radial_glia_Thalamus_Pre","Radial_glia_Cerebellum_Pre"])].copy()
df = dc.rank_sources_groups(acts_RG, groupby='AggregatedClass', reference="rest", method='t-test_overestim_var')
n_markers = 10
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.tl.dendrogram(acts_RG, groupby="AggregatedClass")
sc.pl.matrixplot(acts_RG, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='RdBu_r')
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_AggregatedClass']`
No description has been provided for this image
In [42]:
acts_NEU = acts[acts.obs["AggregatedClass"].isin(["Neuron_Cortex_Pre","Neuron_Thalamus_Pre","Neuron_Cerebellum_Pre"])].copy()
df = dc.rank_sources_groups(acts_NEU, groupby='AggregatedClass', reference="rest", method='t-test_overestim_var')
n_markers = 10
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.tl.dendrogram(acts_NEU, groupby="AggregatedClass")
sc.pl.matrixplot(acts_NEU, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='RdBu_r')
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_AggregatedClass']`
No description has been provided for this image
In [43]:
acts_RG = acts[acts.obs["AggregatedClass"].isin(["Radial_glia_Cortex_Pre","Radial_glia_Thalamus_Pre","Radial_glia_Cerebellum_Pre"])].copy()
df = dc.rank_sources_groups(acts_RG, groupby='AggregatedClass', reference="rest", method='t-test_overestim_var')
n_markers = 10
source_markers = df.groupby('group').head(n_markers).groupby('group')['names'].apply(lambda x: list(x)).to_dict()
source_markers
sc.tl.dendrogram(acts_RG, groupby="AggregatedClass")
sc.pl.matrixplot(acts_RG, source_markers, 'AggregatedClass', dendrogram=True, standard_scale='var',
                 colorbar_title='Z-scaled scores', cmap='RdBu_r')
    using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_AggregatedClass']`
No description has been provided for this image
In [44]:
sc.pl.rank_genes_groups(CombinedAdata, n_genes=25, ncols=4, fontsize=8)
No description has been provided for this image